25 research outputs found

    Analisi gerarchica degli inviluppi spettrali differenziali di una voce emotiva

    Get PDF
    .In questo articolo viene descritto un nuovo metodo di analisi del timbro vocale tramite lo studio delle variazioni di inviluppo spettrale utilizzato da uno stesso parlatore in situazioni emotiva neutra o espressiva. Il contesto dell\u27analisi riguarda un corpus di un solo parlatore istruito a leggere una serie di frasi utilizzando uno stile di lettura neutro e successivamente utilizzando due modalit? emotive: uno stile allegro e uno stile triste. Gli inviluppi spettrali relativi alle versioni allineate delle realizzazioni vocali neutre e espressive (allegra e triste) sono confrontati utilizzando un metodo differenziale. Le differenze sono state calcolate tra lo stato emotivo e quello neutro, di conseguenza le due categorie messe a confronto sono neutro-allegro e neutro-triste. La statistica degli inviluppi differenziali ? stata calcolata per ogni fono. I dati sono stati esaminati utilizzando un metodo di clustering gerarchico di tipo agglomerativo. I cluster risultanti sono avvalorati con diverse misure di distanza tra le distribuzioni statistiche ed esplorati visivamente per trovare similitudini e differenze tra le due categorie. I risultati mettono in evidenza sistematiche variazioni nel timbro vocale relative ai due insiemi di differenze di inviluppi spettrali. Questi tratti dipendono dalla valenza dell\u27emozione presa in considerazione (positiva, negativa) come dalle propriet? fonetiche del particolare fono come ad esempio sonorit? e luogo di articolazione

    Cluster Analysis of Differential Spectral Envelopes on Emotional Speech

    Get PDF
    This paper reports on the analysis of the spectral variation of emotional speech. Spectral envelopes of time aligned speech frames are compared between emotionally neutral and active utterances. Statistics are computed over the resulting differential spectral envelopes for each phoneme. Finally, these statistics are classified using agglomerative hierarchical clustering and a measure of dissimilarity between statistical distributions and the resulting clusters are analysed. The results show that there are systematic changes in spectral envelopes when going from neutral to sad or happy speech, and those changes depend on the valence of the emotional content (negative, positive) as well as on the phonetic properties of the sounds such as voicing and place of articulation

    Due tecniche di vocoding per la sintesi di parlato emotivo mediante trasformazione del timbro vocale

    Get PDF
    In questo articolo vengono descritte due tecniche di modifica del timbro vocale utilizzate in un esperimento di trasformazione della voce con l\u27obiettivo di riprodurre alcune caratteristiche del parlato emotivo. Il segnale vocale emesso da un parlatore con stile di lettura neutro viene convertito in modo da riprodurre l\u27inviluppo spettrale utilizzato dallo stesso parlatore in una situazione emotiva non neutra. La funzione di conversione tra gli inviluppi spettrali ? calcolata utilizzando un metodo ricavato con un addestramento su dati reali. Per questo motivo ? stato preso in considerazione un database contenente la voce di un parlatore registrato durante la lettura/recitazione di un corpus di testi con diversi stili emozionali: allegro, triste e uno stile neutro di riferimento. Le due tecniche di generazione della forma d\u27onda (vocoding) prese in considerazione sono il Phase Vocoder e il filtro MLSA (Mel Log Spectrum Approximation). I due prototipi implementati sono stati valutati con test di tipo percettivo, mentre valutazioni oggettive hanno convalidato l\u27efficacia della funzione di conversione

    La sentiment analysis come strumento di studio del parlato emozionale?

    Get PDF
    Vari studi in letteratura hanno dimostrato che il parlato emozionale è caratterizzato da vari indici acustici. Tuttavia, tali studi hanno quasi sempre utilizzato parlato recitato, ignorando il parlato elicitato in maniera ecologica a causa della difficoltà nel reperire adeguate produzioni emozionali. In questo contributo, esploriamo la possibilità di utilizzare la sentiment analysis per selezionare produzioni emozionali da corpora orali. Abbiamo utilizzato il corpus LibriSpeech, da cui abbiamo estratto valori di sentiment analysis a livello di frase e di parola, nonché vari indici acustici e spettrali associati al parlato emozionale. L’analisi della relazione tra i livelli acustico e testuale ha rivelato effetti significativi ma di portata ridotta. Questo ci fa pensare che tali due livelli (acustico e lessicale) tendano a essere relativamente indipendenti, rendendo inappropriato l’utilizzo di metriche testuali per la selezione di materiale acusticamente emozionale.Abundant literature has shown that emotional speech is characterized by various acoustic cues. However, most studies focused on sentences produced by actors, disregarding ecologically elicited speech due to difficulties in finding suitable emotional data. In this contribution we explore the possibility of using sentiment analysis for the selection of emotional chunks from speech corpora. We used the LibriSpeech corpus and extracted sentiment analysis scores at word and sentence levels, as well as several acoustic and spectral parameters of emotional voice. The analysis of the relation between textual and acoustic indices revealed significant but small effects. This suggests that these two levels tend to be fairly independent, making it improper to use sentiment analysis for the selection of acoustically emotional speech

    An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings

    Full text link
    We performed an experimental review of current diarization systems for the conversational telephone speech (CTS) domain. In detail, we considered a total of eight different algorithms belonging to clustering-based, end-to-end neural diarization (EEND), and speech separation guided diarization (SSGD) paradigms. We studied the inference-time computational requirements and diarization accuracy on four CTS datasets with different characteristics and languages. We found that, among all methods considered, EEND-vector clustering (EEND-VC) offers the best trade-off in terms of computing requirements and performance. More in general, EEND models have been found to be lighter and faster in inference compared to clustering-based methods. However, they also require a large amount of diarization-oriented annotated data. In particular EEND-VC performance in our experiments degraded when the dataset size was reduced, whereas self-attentive EEND (SA-EEND) was less affected. We also found that SA-EEND gives less consistent results among all the datasets compared to EEND-VC, with its performance degrading on long conversations with high speech sparsity. Clustering-based diarization systems, and in particular VBx, instead have more consistent performance compared to SA-EEND but are outperformed by EEND-VC. The gap with respect to this latter is reduced when overlap-aware clustering methods are considered. SSGD is the most computationally demanding method, but it could be convenient if speech recognition has to be performed. Its performance is close to SA-EEND but degrades significantly when the training and inference data characteristics are less matched.Comment: 52 pages, 10 figure

    Leveraging Speech Separation for Conversational Telephone Speaker Diarization

    Full text link
    Speech separation and speaker diarization have strong similarities. In particular with respect to end-to-end neural diarization (EEND) methods. Separation aims at extracting each speaker from overlapped speech, while diarization identifies time boundaries of speech segments produced by the same speaker. In this paper, we carry out an analysis of the use of speech separation guided diarization (SSGD) where diarization is performed simply by separating the speakers signals and applying voice activity detection. In particular we compare two speech separation (SSep) models, both in offline and online settings. In the online setting we consider both the use of continuous source separation (CSS) and causal SSep models architectures. As an additional contribution, we show a simple post-processing algorithm which reduces significantly the false alarm errors of a SSGD pipeline. We perform our experiments on Fisher Corpus Part 1 and CALLHOME datasets evaluating both separation and diarization metrics. Notably, without fine-tuning, our SSGD DPRNN-based online model achieves 12.7% DER on CALLHOME, comparable with state-of-the-art EEND models despite having considerably lower latency, i.e., 50 ms vs 1 s.Comment: Submitted to INTERSPEECH 202

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Řečová syntéza a emoce: kompromis mezi flexibilitou a akceptovatelností

    No full text
    Článek se zabývá možnostmi emocionální syntézy v systémech převodu textu na řeč.The synthesis of emotional speech is still an open question. The principal issue is how to introduce expressivity without compromising the naturalness of the synthetic speech produced by means of the state-of-the-art technology. In this paper two concatenative synthesis systems are described and some approaches to address this topic are proposed. One of the reported proposals is to consider the intrinsic expressivity of certain speech acts, i.e. exploit the correlation between affective states and communicative functions. This implies a different approach in the design of the databases for speech synthesis. In fact, beyond phonetic and prosodic criteria, linguistic and pragmatic aspects should also be considered
    corecore